Video based action recognition is one of the important and challengingproblems in computer vision research. Bag of Visual Words model (BoVW) withlocal features has become the most popular method and obtained thestate-of-the-art performance on several realistic datasets, such as the HMDB51,UCF50, and UCF101. BoVW is a general pipeline to construct a globalrepresentation from a set of local features, which is mainly composed of fivesteps: (i) feature extraction, (ii) feature pre-processing, (iii) codebookgeneration, (iv) feature encoding, and (v) pooling and normalization. Manyefforts have been made in each step independently in different scenarios andtheir effect on action recognition is still unknown. Meanwhile, video dataexhibits different views of visual pattern, such as static appearance andmotion dynamics. Multiple descriptors are usually extracted to represent thesedifferent views. Many feature fusion methods have been developed in other areasand their influence on action recognition has never been investigated before.This paper aims to provide a comprehensive study of all steps in BoVW anddifferent fusion methods, and uncover some good practice to produce astate-of-the-art action recognition system. Specifically, we explore two kindsof local features, ten kinds of encoding methods, eight kinds of pooling andnormalization strategies, and three kinds of fusion methods. We conclude thatevery step is crucial for contributing to the final recognition rate.Furthermore, based on our comprehensive study, we propose a simple yeteffective representation, called hybrid representation, by exploring thecomplementarity of different BoVW frameworks and local descriptors. Using thisrepresentation, we obtain the state-of-the-art on the three challengingdatasets: HMDB51 (61.1%), UCF50 (92.3%), and UCF101 (87.9%).
展开▼